- Home
- Search Results
- Page 1 of 1
Search for: All records
- 
                                    Total Resources3
- Resource Type
- 
                                    
                                    
                                    
                                    0003000000000000
- More
- Availability
- 
                                    
                                    21
- Author / Contributor
- Filter by Author / Creator
- 
                                    
                                        - 
                                                    
                                                        
                                                            
                                                            Gururangan, Suchin (3)
- 
                                                    
                                                        
                                                            
                                                            Soldaini, Luca (2)
- 
                                                    
                                                        
                                                            
                                                            Zettlemoyer, Luke (2)
- 
                                                    
                                                        
                                                            
                                                            Abbas, Amro (1)
- 
                                                    
                                                        
                                                            
                                                            Albalak, Alon (1)
- 
                                                    
                                                        
                                                            
                                                            Arora, Kushal (1)
- 
                                                    
                                                        
                                                            
                                                            Bamman, David (1)
- 
                                                    
                                                        
                                                            
                                                            Bansal, Hritik (1)
- 
                                                    
                                                        
                                                            
                                                            Bitton, Yonatan (1)
- 
                                                    
                                                        
                                                            
                                                            Carmon, Yair (1)
- 
                                                    
                                                        
                                                            
                                                            Chandu, Khyathi (1)
- 
                                                    
                                                        
                                                            
                                                            Chen, Mayee (1)
- 
                                                    
                                                        
                                                            
                                                            Daras, Giannis (1)
- 
                                                    
                                                        
                                                            
                                                            Dave, Achal (1)
- 
                                                    
                                                        
                                                            
                                                            Dimakis, Alexandros_G (1)
- 
                                                    
                                                        
                                                            
                                                            Dodge, Jesse (1)
- 
                                                    
                                                        
                                                            
                                                            El-Nouby, Alaaeldin (1)
- 
                                                    
                                                        
                                                            
                                                            Faghri, Fartash (1)
- 
                                                    
                                                        
                                                            
                                                            Fang, Alex (1)
- 
                                                    
                                                        
                                                            
                                                            Gadre, Samir (1)
 
- 
                                                    
                                                        
                                                            
                                                            
- Filter by Editor
- 
                                    
                                        - 
                                                    
                                                        
                                                            
                                                            & Spizer, S. M. (0)
- 
                                                    
                                                        
                                                            
                                                            & . Spizer, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ahn, J. (0)
- 
                                                    
                                                        
                                                            
                                                            & Bateiha, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Bosch, N. (0)
- 
                                                    
                                                        
                                                            
                                                            & Brennan K. (0)
- 
                                                    
                                                        
                                                            
                                                            & Brennan, K. (0)
- 
                                                    
                                                        
                                                            
                                                            & Chen, B. (0)
- 
                                                    
                                                        
                                                            
                                                            & Chen, Bodong (0)
- 
                                                    
                                                        
                                                            
                                                            & Drown, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ferretti, F. (0)
- 
                                                    
                                                        
                                                            
                                                            & Higgins, A. (0)
- 
                                                    
                                                        
                                                            
                                                            & J. Peters (0)
- 
                                                    
                                                        
                                                            
                                                            & Kali, Y. (0)
- 
                                                    
                                                        
                                                            
                                                            & Ruiz-Arias, P.M. (0)
- 
                                                    
                                                        
                                                            
                                                            & S. Spitzer (0)
- 
                                                    
                                                        
                                                            
                                                            & Sahin. I. (0)
- 
                                                    
                                                        
                                                            
                                                            & Spitzer, S. (0)
- 
                                                    
                                                        
                                                            
                                                            & Spitzer, S.M. (0)
- 
                                                    
                                                        
                                                            
                                                            (submitted - in Review for IEEE ICASSP-2024) (0)
 
- 
                                                    
                                                        
                                                            
                                                            
- 
                                    Have feedback or suggestions for a way to improve these results?
 !
                                    
                                        
                                            Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            Large language models’ (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are underscrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten “quality” and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.more » « less
- 
            Min, Sewon; Gururangan, Suchin; Wallace, Eric; Shi, Weijia; Hajishirzi, Hannaneh; Smith, Noah; Zettlemoyer, Luke (, ICLR)
- 
            Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (, https://doi.org/10.48550/arXiv.2406.11794)The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.more » « lessFree, publicly-accessible full text available April 21, 2026
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                     Full Text Available
                                                Full Text Available